arXiv:1505.02729vl [cs.LG] 11 May 2015 


Sample Complexity of Learning Mahalanobis 
Distance Metrics 

Nakul Verma* and Kristin Branson^ 

Janelia Research Campus 
Howard Hughes Medical Institute, Virginia, USA 


Abstract 

Metric learning seeks a transformation of the feature space that enhances prediction quality 
for the given task at hand. In this work we provide PAC-style sample complexity rates for su¬ 
pervised metric learning. We give matching lower- and upper-bounds showing that the sample 
complexity scales with the representation dimension when no assumptions are made about the 
underlying data distribution. However, by leveraging the structure of the data distribution, we 
show that one can achieve rates that are fine-tuned to a specific notion of intrinsic complexity for 
a given dataset. Our analysis reveals that augmenting the metric learning optimization criterion 
with a simple norm-based regularization can help adapt to a dataset’s intrinsic complexity, yield¬ 
ing better generalization. Experiments on benchmark datasets validate our analysis and show that 
regularizing the metric can help discern the signal even when the data contains high amounts of 
noise. 


1 Introduction 

In many machine learning tasks, data is represented in a high-dimensional Euclidean space where 
each dimension corresponds to some interesting measurement of the observation. Often, practition¬ 
ers include a variety of measurements in hopes that some combination of these features will capture 
the relevant information. While it is natural to represent such data in a Real space of measure¬ 
ments, there is no reason to expect that using Euclidean (L 2 ) distances to compare the observations 
will be necessarily useful for the task at hand. Indeed, the presence of uninformative or mutually 
correlated measurements simply inflates the L 2 -distances between pairs of observations, rendering 
distance-based comparisons ineffective. 

Metric learning has emerged as a powerful technique to learn a good notion of distance or a 
metric in the representation space that can emphasize the feature combinations that help in the pred¬ 
ication task while suppressing the contribution of spurious measurements. The past decade has seen 
a variety of successful metric learning algorithms that leverage various attributes of the problem 
domain. A few notable examples include exploiting class labels to find a Mahalanobis distance 
metric that maximizes the distance between dissimilar observations while minimizing distances be¬ 
tween similar ones to improve classification quality ( [Weinberger & Saul| [2009 1 [Davis et ahj [2007[ l, 

* email: verman@ janelia. hhmi . org; corresponding author. 

^ email: brans on k@ janelia. hhmi . org 


1 





and explicitly optimizing for a downstream prediction task such as information retrieval (McFee & 
|Lanckrietl|2010| l. 

Despite the popularity of metric learning methods, few studies have focused on studying how the 
problem complexity scales with key attributes of a given dataset. For instance, how do we expect 
the generalization error to scale—^both theoretically and practically—as one varies the number of 
informative and uninformative measurements, or changes the noise levels? 

Here we study supervised metric learning more formally and gain a better understanding of how 
different modalities in data affect the metric learning problem. We develop two general frameworks 
for PAC-style analysis of supervised metric learning. We can categorize the popular metric learning 
algorithms into an empirical error minimization problem in one of the two frameworks. The hrst 
generic framework, the distance-based metric learning framework, uses class label information to 
derive distance constraints. The key objective is to learn a metric that on average yields smaller 
distances between examples from the same class than those from different classes. Some popular 
algorithms that optimize for such distance-based objectives include Mahalanobis Metric for Clus¬ 
tering (MMC) by Xing et al. ( 2002|l and Information Theoretic Metric Learning (ITML) by Davis 


et al. (2007 1 . Instead of using distance comparisons as a proxy, however, one can also optimize for 


a specihc prediction task directly. The second generic framework, the classiher-based metric learn¬ 
ing framework, explicitly incorporates the hypothesis associated with the prediction task of interest 
to learn effective distance metrics. A few interesting examples in this regime include the work by 
McFee & Lanckriet p010|l that hnds metrics that improve ranking quality in information retrieval 


tasks, and the work by Shaw et al. ( 2011| l that learns metrics that help predict connectivity structure 
in networked data. 

Our analysis shows that in both frameworks, the sample complexity scales with the represen¬ 
tation dimension for a given dataset (Lemmas [T] and |^, and this dependence is necessary in the 
absence of any specihc assumptions on the underlying data distribution (Lemmas and [^. By 
considering any Lipschitz loss, our results generalize previous sample complexity results (see our 
discussion in Section]^ and, for the hrst time in the literature, provide matching lower bounds. 

In light of the observation made earlier that data measurements often include uninformative or 
weakly informative features, we expect a metric that yields good generalization performance to de- 
emphasize such features and accentuate the relevant ones. We can thus formalize the metric learning 
complexity of a given dataset in terms of the intrinsic complexity d of the metric that reweights the 
features in a way that yields the best generalization performance. (For Mahalanobis distance metrics, 
we can characterize the intrinsic complexity by the norm of the matrix representation of the metric.) 
We rehne our sample complexity result and show a dataset-dependent bound for both frameworks 
that scales with dataset’s intrinsic metric learning complexity d (Corollary]^. 

Taking guidance from our dataset-dependent result, we propose a simple variation on the empir¬ 
ical risk minimizing (ERM) algorithm that, when given an i.i.d. sample, returns a metric (of com¬ 
plexity d) that jointly minimizes the observed sample bias and the expected intra-class variance for 
metrics of hxed complexity d. This bias-variance balancing algorithm can be viewed as a structural 
risk minimizing algorithm that provides better generalization performance than an ERM algorithm 
and justihes norm-regularization of weighting metrics in the optimization criteria for metric learn¬ 
ing. 


Einally, we evaluate the practical efficacy of our proposed norm-regularization criteria with some 
popular metric learning algorithms on benchmark datasets (Section |^. Our experiments highlight 
that the norm-regularization indeed helps in learning weighting metrics that better adapt to the signal 
in data in high-noise regimes. 
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2 Preliminaries 


Given a representation space X = of D real-valued measurements of observations of interest, 
the goal of metric learning is to learn a metric M (that is, a D x D real-valued weighting matrix 
on X; to remove arbitrary scaling we shall assume that the maximum singular value of M, that is, 
o’maxiM) = l|^that minimizes some notion of error on data drawn from an unknown underlying 
distribution V on X x {0,1}. Specifically, we want to find the metric 

M* := axgmmj^^j^ err(M, V), 

from the class of metrics Af under consideration, that is, Af := {M \ M G ,a^ax{M) = 1}. 
For supervised metric learning, this error is typically label-based and can be defined in multiple rea¬ 
sonable ways. As discussed earlier, we explore two intuitive regimes for defining error. 


Distance-based error. A popular criterion for quantifying error in metric learning is by com¬ 
paring distances amongst points drawn from the underlying data distribution. Ideally, we want a 
weighting metric M that brings data from the same class closer together than those from opposite 
classes. In a distance-based framework, a natural way to accomplish this is to find a weighting M 
that yields shorter distances between pairs of observations from the same class than those from dif¬ 
ferent classes. By penalizing how often and by how much the distances violate these constraints 
gives rise to the particular form of the error. 

Let the variable z = (a;, y) denote a random draw from T) with x G A as the observation and 
y G {0,1} its associated label, and let A denote how severely one wants to penalize the distance 
violations, then a natural definition of distance-based error becomes: 






for a generic distance-based loss function ,Y), that computes the degree of violation between 

weighted distance p^(xi, X 2 ) := ||M(xi — X 2 )|p and the label agreement Y := l[yi = 1 / 2 ] among 
a pair zi = (xi, yi) and Z 2 = (x2,2/2) drawn from T). 

An example instantiation of (j) popular in literature encourages metrics that yield distances that 
are no more than some upper limit U between observations from the same class, and distances that 
are no less than some lower limit L between those from different classes (for some U < L). Thus 


4‘l,u(Pm^Y) 


min{l, A[p„ify = l 
min{l, A[L — otherwise ’ 


( 1 ) 


where [A]^ := max{0, A}. 

Xing et ar](|2002|l optimize an efficiently computable variant of this criterion, in which they look 


for a metric that keeps the total pairwise distance amongst the observations from the same class 
less than a constant while maximizing the total pairwise distance amongst the observations from 
opposite classes. The variant proposed by Davis et al. ( 2007| l explicitly includes the upper and lower 
limits with an added regularization on the learned M to be close to a pre-specified metric of interest 

Mq. 

While we discuss loss-functions (f) that handle distances between a pair of observations, it is easy 
to extend to distances among triplets. Rather than having hard upper and lower limits which every 


'Note that we are looking at the linear form of the metric M; usually the corresponding quadratic form M~^ M is discussed 
in the literature, which is necessarily positive semi-definite. 
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pair of the same and the opposite classes must obey, a triplet-based comparison typically focuses on 
relative distances between three observations at a time. A natural instantiation in this case becomes: 


Pm (a^l: (Z/l, 2/2, J/s)) 


min{l, A[p„(a:i,X 2 ) - p„(a:i,a; 3 )]_^} if j/i = 2/2 7^ 2/3 
0 otherwise 


for a triplet {xi,yi), (2: 2 ,2/2), (a^ 3 , 2 / 3 ) drawn from V. 

Weinberger & Sau'I| ( |20091 l discuss an interesting variant of this, in which instead of looking at 
all triplets in a given training sample, they focus on triplets of observations in local neighborhoods 
and learn a metric that maintains a gap or a margin among distances between observations from the 
same class and those from the opposite class. Improving the quality of distance comparisons in local 
neighborhoods directly affects the nearest neighbor performance, making this a popular technique. 


Classifier-based Error. Distance comparisons typically act as a surrogate for a specific down¬ 
stream prediction task. If we want a metric that directly optimizes for a task, we need to explicitly 
incorporate the hypothesis class being used for that task while finding a good weighting metric. 

This simple but effective insight has been used recently by McFee & Lanckriet ( 2010[ ) for im¬ 
proving ranking results in information retrieval problems by explicitly incorporating ranking losses 
while learning an effective weighting metric. [Shaw et ah ( 201 l| l also follow this principle and ex¬ 
plicitly include network topology constraints to learn a weighting metric that can better predict the 
connectivity structure in social and web networks. 

We can formalize the classifier-based metric learning framework by considering a fixed hypothe¬ 
sis class Ti. of interest on the measurement domain. To keep the discussion general, we shall assume 
that the hypotheses are real-valued and can be regarded as a measure of confidence in classification, 
that is, each h G H is of the form /i : X —[0,1]. (One can obtain the binary predictions from h 
by a simple thesholding at 1/2.) Then, the error induced by a particular weighting metric M on the 
measurement space X can be defined as the best possible error that can be obtained by hypotheses 
in H, that is 


errhypoth(M,X>) 


inf E 
h^n 


{x,y)^'D 


1 [\h{Mx)-y\ > 1/2] 


We shall study how this error scales with various key parameters of the metric learning problem. 


3 Learning a Metric from Samples 

In any practical setting, we estimate the ideal weighting metric M* by minimizing the empirical 
version of the error criterion from a finite size sample from T). 

Let Sm denote a sample of size m, and err(M, Sm) denote the empirical error on the sample 
Sm (the exact definitions of Sm and the form of err(M, Sm) are discussed later). We can then 
define the empirical risk minimizing metric based on m samples as := argminj^^ err(M, Sm)- 
Most practical algorithms, of course, return some approximation of M^, and thus it is important to 
compare the generalization ability of to that of theoretically optimal M*. That is, how 

err(M;;,D)-err(M*,D) (2) 

converges as the sample size m grows. 


4 




















3.1 Distance-Based Error Analysis 

Given an i.i.d. sequence of observations zi, Z 2 , • • • from V, we can pair the observations together to 
form a paired sample S'™ = {{zi, Z 2 ), [z^, Z 4 ), ■ ■ ■, Z 2 m)} = {(^i,z, ^2,z)}fei of size m, 

and define the sample based distance error errjjj((M, Sm) induced by a metric M as 

^ m 

■■= — V(/)^(p„(xi,i,a;2,i),l[yi,z =y2,*])- 

Then for any bounded support distribution V (that is, each {x,y) ~ V, ||a:|| < B < 00 ), we 
have the following convergence result]^ 

Lemma 1 Fix any sample size m, and let S„i be an i.i.d. paired sample of size mfrom an unknown 
bounded distribution T) (with bound B). For any distance-based loss function that is X-Lipschitz 
in the first argument, with probability at least 1 — 5 over the draw of Sm, 


sup [err^i,t(M, V)- 5„)] <oi ] . 

MdM ^ y \ m J 

Using this lemma we can get the desired convergence rate (Eq.|^. Fix M* G A4, then for any 
0 < 5 < 1 and m > 1, with probability at least 1 — 5, we have 

errL(M:,,2?)-eiTL(MM?) 

= err^ist(^m: “ err^ist(^m, ^m) + Sm) - en^^fMfSm) 

Id\w{II5)\ /ln(2/5) 


< 0\XB^ 


= 0\XB‘ 


m 


DHm\ 

m j 


2m 


by noting (i) Sm) < ^ ^m), since is empirical eiTor minimizing on Sm, 

and (ii) by using Hoeffding’s inequality on the fixed M* to conclude that with probability at least 

1 - 5/2, err,\,(M*, - erri,(M*, V) < 

Thus to achieve a specihc estimation error rate e, the number of samples m = 
are sufficient to conclude, with confidence at least 1 — 5, the empirical risk minimizing metric 
will have estimation error of at most e. This shows that one never needs more than a number pro¬ 
portional to the representation dimension D examples to achieve the desired level of accuracy. 

Since typical applications have a large representation dimension, it is instructive to study if such 
a strong dependen(^ on D necessary. It turns out that even for simple distance-based loss functions 
like (j)\ jj (cf. Eq. [lb, there are data distributions for which one cannot get away with fewer than 
linear in D samples and ensure good estimation errors. In particular we have the following. 

Lemma 2 Let A be any algorithm that, given an i.i.d. sample Sm (of size m) from a fixed unknown 
bounded support distribution B, returns a weighting metric from A4 that minimizes the empirical 

^We only present the results for paired distance comparisons; the results are easily extended to triplet-based comparisons. 
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error with respect to distance-based loss function jj. There exist X > 0, 0 < U < L, such that 
for allQ < e^5 < 1/ 64, there exists a bounded support distribution D, such that ifm< ^^^2 then 




> 5. 


While this may seem discouraging for large-scale applications of metric learning, note that here 
we made no assumptions about the underlying structure of the data distribution V, making this a 
worst-case analysis. As the individual features in real-world datasets contain varying amounts of 
information for good classification performance, one hopes for a more relaxed dependence on D for 
metric learning in these settings. This is explored in Section]^ 


3.2 Classifier-Based Error Analysis 


In this setting, we can use an i.i.d. sequence of observations zi, Z 2 ,... from T) to obtain the sample 
Sm = {^i}™ 1 of size m directly. To analyze the generalization ability of the weighting metrics 
optimized with respect to an underlying hypothesis class H, we need to effectively analyze the 
classification complexity of H. The scale sensitive version of VC-dimension, also known as the 
“fat-shattering dimension”, of a real-valued hypothesis class (denoted by Fat-,(T()) encodes the 
right notion of classification complexity and provides an intuitive way to relate the generalization 


error to the empirical error at a margin 7 (see for instance the work of Anthony & Bartlett (19991 
for an excellent discussion). 

In the context of metric learning with respect to a fixed hypothesis class, define the empirical 
error at a margin 7 as 


:= mtl 


^ l[Margin(/i(Ma;i), pi) < 7], 


{Xi,yi)eSrr 


Where Margin(y,y) := { otherwise ’ 

Then for any bounded support distribution T) (that is, each {x,y) ^ V, ||a:|| < B < 00 ), we 
have the following convergence result that relates the estimation error rate of the weighting metrics 
with that of the fat-shattering dimension of the underlying base hypothesis class. 

Lemma 3 Let % be a X-Lipschitz base hypothesis class. Pick any 0 < 7 < 1/2, and let m > 
Fat..y/]^g('H) > 1. Then with probability at least 1 — (5 over an i.i.d. draw of sample Sm (of size m) 
from a bounded unknown distribution T) (with bound B) on X x {0,1}, 


sup 

MgM 


^tTjiypoth 5 


— err 


7 

hypoth 




< o 


11 D\ D 

— In -- -I-In — 

m 0 m Cq 


Fat,y/i6(”H) 

m 



where cq := min{^, 2 ^}’ Fat.y/ig('H) is the fat-shattering dimension of the base hypothesis 
class TL at margin 7 /16. 

Using a similar line of argument as before, we can bound the key quantity of interest (Eq.|^ and 
conclude for any 0 < 7 < 1/2 and any m > 1, with probability > 1 — <5 


errhypoth(M/,,2?) - P) = O 


/£)2 1n(Z)/eo) , ^3t^/i6i‘H)ln{m/Sy) 
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Here cq = min{^, for a A-Lipschitz hypothesis class T-L. Thus to achieve a specific estimation 
eiTor rate e, the number of samples m = ^ ^ ^ in(A£)a/7)+Fa^/i6(?^) in(i/<57) ^ s^ffJ(;es to say, with 

confidence at least 1 — (5, the empirical risk minimizing metric will have estimation error at 
most e. 

It is interesting to note that the task of finding an optimal metric only additively increases the 
sample complexity over the complexity of finding the optimal hypothesis from the underlying hy¬ 
pothesis class. 

In contrast to the sample complexity of distance-based framework (c.f. Lemma [^, here we get 
a quadratic dependence on the representation dimension. The following lemma shows that a strong 
dependence on the representation dimension is necessary in absence of any specific assumptions on 
the underlying data distribution and the base hypothesis class. 

Lemma 4 Pick any 0 < 7 < 1/8. Let % be a base hypothesis class of X-Lipschitz functions 
mapping from X = into the interval [1/2 — 47 ,1/2-1- 47 ] that is closed under addition of 
constants. That is 


h G H => h' G H, where h' : x h{x) -f c for all c. 

Then for any classification algorithm A, and for any B > 1, there exists A > 0, for all 0 < 
e, 5 < 1/64, there exists a bounded support distribution T) (with bound B) such that ifmhi'm < 
r)( _ D^+d _\ 

111(1/72) j 

Ps„-i>[errhypoth(/i*, 2 ?) > ^rn), V) + e] > S, 

where d := Fat 7687 (H) is the fat-shattering dimension ofTT at margin 7687. 


4 Data with Uninformative and Weakly Informative Features 

Different measurements have varying degrees of “information content” for the particular supervised 
classification task of interest. Any algorithm or analysis that studies the design of effective compar¬ 
isons between observations must account for this variability. 

To get a solid footing for our study, we introduce the concept of metric learning complexity of 
a given dataset. Our key observation is that a metric that yields good generalization performance 
should emphasize relevant features while suppressing the contribution of spurious features. Thus, a 
good metric reflects the quality of individual feature measurements of data and their relative value for 
the learning task. We can leverage this and define the metric learning complexity of a given dataset 
as the intrinsic complexity d of the weighting metric that yields the best generalization performance 
for that dataset (if multiple metrics yield best performance, we select the one with minimum d). 
A natural way to characterize the intrinsic complexity of a weighting metric M is via the norm of 
the matrix representation of M. Using metric learning complexity as our gauge for the richness of 
the feature set in a given dataset, we can refine our analysis in both our canonical metric learning 
frameworks. 

4.1 Distance-Based Refinement 

We start with the following refinement of the distance-based metric learning sample complexity for 
a class of Frobenius norm-bounded weighting metrics. 
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Lemma 5 Let M. be any class of weighting metrics on the feature space X = Fix any sample 
size m, and let Sm be an i.i.d. paired sample of size mfrom an unknown bounded distribution D on 
X X {0,1} (with bound B). For any distance-based loss function (jf' that is X-Lipschitz in the first 
argument, with probability at least 1 — <5 over the draw of Sm, 

sup [err}i,j(M, V)- err^i,t(M, Sm)] <o( j ^ 

M&M ^ y \ m J 


where d is a uniform upperbound on the Frobenius norm of the quadratic form of weighting metrics 
in M., that is, supj\^g^ < d. 


Observe that if our dataset has a low metric learning complexity (say, d <C D), then considering 
an appropriate class of norm-bounded weighting metrics can help sharpen the sample complexity 
result, yielding a dataset-dependent bound. We discuss how to automatically adapt to the right 
complexity class in Section 4.3 below. 


4.2 Classifier-Based Refinement 


Effective data-dependent analysis of classifier-based metric learning requires accounting for poten¬ 
tially complex interactions between an arbitrary base hypothesis class and the distortion induced 
by a weighting metric to the unknown underlying data distribution. To make the analysis tractable 
while still keeping our base hypothesis class H general, we shall assume that H is a class of two 
layer feed-forward neural networks. Recall that for any smooth target function /*, a two layer 
feed-forward neural network (with appropriate number of hidden units and connection weights) can 
approximate f* arbitrarily well ( |Hornik et al. 19891 l, so this class is flexible enough to incorporate 
most reasonable target hypotheses. 

More formally, define the base hypothesis class of two layer feed-forward neural network with 
K hidden units as 


-T/2-net_ 

^rr-t ■ — 


K 

2=1 


• x) 


kill < 1, k*l|i < 1 


}■ 


where cr^ : R — > [—1,1] is a smooth, strictly monotonic, 7 -Lipschitz activation function with 
cr'’'(0) = 0. Then for the generalization error of a weighting metric M defined with respect to any 
classifier-based A-Lipschitz loss function 

^^hypothiM,D) ■= inf 
we have the following]^ 

Lemma 6 Let A4 be any class of weighting metrics on the feature space X = R^. For any 7 > 0, 
let be a two layer feed-forward neural network base hypothesis class (as defined above) and 
(jf' be a classifier-based loss function that X-Lipschitz in its first argument. Fix any sample size m, 
and let Sm be an i.i.d. sample of size mfrom an unknown bounded distribution V on X x {0,1} 
(with bound B). Then with probability at least 1 — 5, 

sup [err^ f,(M, V)- err^ 5'„)] <oi , 

MgM \ y 771 j 

^Since we know the functional form of the base hypothesis class "H {i.e., a two layer feed-forward neural net), we can 
provide a more precise bound than leaving it as Fat('H). 
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where d is a uniform upperbound on the Frobenius norm of the quadratic fonn of weighting metrics 
in M, that is, < d. 


4.3 Automatically Adapting to Intrinsic Complexity 

Note that while Lemmas and provide a sample complexity bound that is tuned to the metric 
learning complexity of a given dataset, these results are not useful directly since one cannot select 
the correct norm bounded class Af a priori (as the underlying distribution T) is unknown). 

Fortunately, by considering an appropriate sequence of norm-bounded classes of weighting met¬ 
rics, we can provide a uniform bound that automatically adapts to the intrinsic complexity of the 
unknown underlying data distribution T). In particular, we have the following. 

Corollary 7 Fix any m, and let Sm be an i.i.d. sample of size mfrom an unknown bounded distribu¬ 
tion T) (with bound B). Define := {M | < d], and consider the nested sequence of 

weighting metric class C C • • •. Let pd be any non-negative measure across the sequence 
such that pd = ^ (far d = 1,2, ■ ■ ■). Then for any A > 0, with probability at least 1 — 6, 
for all d = 1,2, - ■■, and all M'^ G Ai‘^, 

[err^(M‘‘,V)-err’'(M‘^,Sm)] < O • BA, (3) 

where C B for distance-based error, or C := ys/hiD for classifier-based error (with base 
hypothesis class 

In particular, for a data distribution D that has metric learning complexity at most d G H, if 
there are m > samples, then with probability at least 1 — 5 

[err^(M;^s,I?)-err^(M*,I?)] < 0 (e), 


err' 


'(Af, Sjji) + 


where K,, :=CBX 


andd^ := 


for Mm® :=argmin^g^ 

[ . 

Observe that the measure (pd) above encodes our prior belief on the complexity class from 
which a target metric is selected by a metric learning algorithm given the training sample Sm- In 
absence of any prior beliefs, pd can be simply set to 1/D (for d = 1,... ,D) for unit spectral-norm 
weighting metrics. 

Thus, for an unknown underlying data distribution D with metric learning complexity d, with 
number of samples just proportional to d, we can find a good weighting metric. 

This result also highlights that the generalization error of any weighting metric returned by an 
algorithm is proportional to the (smallest) norm-bounded class to which it belongs (cf Eq. |^. If 
two metrics Mi and M 2 have similar empirical errors on a given sample, but have different intrinsic 
complexities, then the expected risk of the two metrics can be considerably different. We expect the 
metric with lower intrinsic complexity to yield better generalization error. This partly explains the 
observed empirical success of various types of norm-regularized optimization criteria for finding the 
optimal weighting metric ( Lim et ah] 2013 Law et alT] 2014| l. 

Using this as a guiding principle, we can design an improved optimization criteria for met¬ 
ric learning problems that jointly minimizes the sample error and a Frobenius norm regularization 
penalty. In particular. 


min 

mgm 


en{M,Sm) + AliM^MlI^ 


(4) 
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Figure 1; Nearest-neighbor classification performance of LMNN and ITML metric learning algorithms with¬ 
out regularization (dashed red lines) and with regularization (solid blue lines) on benchmark UCI datasets. The 
horizontal dotted line is the classification error of random label assignment drawn according to the class pro¬ 
portions, and solid gray line shows classification error of fc-NN performance with respect to identity metric (no 
metric learning) for baseline reference. 

for any error criteria ‘err’ used in a downstream prediction task of interest and a regularization 
hyper-parameter A proportional to We explore the practical efficacy of this augmented 

optimization on some representative applications below. 


5 Empirical Evaluation 


Our analysis shows that the generalization error of metric learning can scale with the representation 
dimension, and regularization can help mitigate this by adapting to the intrinsic metric learning 
complexity of the given dataset. We want to explore to what degree these effects manifest in practice. 

We select two popular metric learning algorithms, LMNN by [Weinberger & Saul| (|2009| and 
ITML by [Davis et al. (20071, that are designed to find metrics that improve nearest-neighbor classi¬ 
fication quality. These algorithms have varying degrees of regularization built into their optimization 
criteria: LMNN implicitly regularizes the metric via its “large margin” criterion, while ITML allows 
for explicit regularization by letting the practitioners specify a “prior” weighting metric. We modi¬ 
fied the LMNN optimization criteria as per Eq. (0 to also allow for an explicit norm-regularization 
controlled by the trade-off parameter A. 

We can evaluate how the unregularized criteria (i.e., unmodified LMNN, or ITML with the prior 
set to the identity matrix) compares to the regularized criteria (i.e., modified LMNN with best A, or 
ITML with the prior set to a low-rank matrix). 


Datasets. We use the UCI benchmark datasets for our experiments: IRIS (4 dim., 150 samples), 


Wine (13 dim., 178 samples) and Ionosphere (34 dim., 351 samples) datasets (Bache & Lich 


man 


2013[l. Each dataset has a fixed (unknown) intrinsic dimension; we can vary the representation 


dimension by augmenting each dataset with synthetic correlated noise of varying dimensions, sim¬ 
ulating regimes where datasets contain large numbers of uninformative features. 

Each UCI dataset is augmented with synthetic D-dimensional correlated noise as follows. We 
first sample a covariance matrix from unit-scale Wishart distribution (that is, let Ahe a D x D 
Gaussian random matrix with entry ~ N{0, 1) drawn i.i.d., and set := A^A). Then each 
sample Xi from the dataset is appended independently by drawing noise vector x^ ^ N{0, T,jy). 


Experimental setup. We varied the ambient noise dimension D between 0 and 500 dimensions and 
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added it to the UCI datasets, creating the noise-augmented datasets. Each noise-augmented dataset 
was randomly split between 70% training, 10% validation, and 20% test samples. 

We used the default settings for each algorithm. For regularized LMNN, we picked the best 
performing trade-off parameter A from {0,0.1,0.2,...,!} on the validation set. For regularized 
ITML, we seeded with the rank-one discriminating metric, Le., we set the prior as the matrix with 
all zeros, except the diagonal entry corresponding to the most discriminating coordinate set to one. 

All the reported results were averaged over 20 runs. 

Results. Figure [T] shows the nearest-neighbor performance (with A: = 3) of LMNN and ITML on 
noise-augmented UCI datasets. Notice that the unregularized versions of both algorithms (dashed 
red lines) scale poorly when noisy features are introduced. As the number of uninformative features 
grows, the performance of both algorithms quickly degrades to that of classification performance in 
the original unweighted space with no metric learning (solid gray line), showing poor adaptability 
to the signal in the data. 

Interestingly, neither of the unregularized algorithms performs consistently better than the other 
on datasets with high noise: ITML yields better results on W INE, whereas LMNN seems better for 
Ionosphere, and both algorithms yield similar performance on Iris. 

The regularized versions of both algorithms (solid blue lines) significantly improve the classifi¬ 
cation performance. Remarkably, regularized ITML shows almost no degradation in classification 
performance, even in very high noise regimes, demonstrating a strong robustness to noise. 

These results underscore the value of regularization in metric learning, showing that regulariza¬ 
tion encourages adaptability to the intrinsic complexity and improved robustness to noise. 


6 Discussion and Conclusion 


Previous theoretical work on metric learning has focused almost exclusively on analyzing the gener¬ 
alization error of variants of the optimization criteria for the distance-based metric learning frame¬ 
work. 

Jin et al. ( 2009) 1, for instance, analyzed the generalization ability of regularized, convex-loss 
optimization criteria for pairwise distances via an algorithmic stability analysis. They derive an 
interesting sample complexity result that is sublinear in sfD for datasets of representation dimen¬ 
sion D. They discuss that the sample complexity can potentially be independent of D, but do not 
characterize specific instances or classes of problems where this may be possible. 

Likewise, recent work by Bellet & Habrard ( |2012[ ) uses algorithmic robustness to analyze the 
generalization ability for pairwise- and triplet-based distance metric learning. Their analysis relies 
on the existence of a partition of the input space, such that in each cell of the partition, the training 
loss and test loss does not deviate much (robustness criteria). Note that their sample complexity 
bound scales with the partition size, which in general can be exponential in the representation di¬ 


mension. 


Perhaps the works most similar to our approach are the sample complexity analyses by Bian & 


Tao ( 201 l| l and Cao et al. p013| l. |Bian & (20111 analyze the consistency of the ERM crite¬ 

rion for metric learning. They show a 0(rn~^l'‘^) rate of convergence for the ERM with m samples 
to the expected risk for thresholds on bounded convex losses for distance-based metric learning. 
Our upper-bound in Lemma[T]generalizes this result by considering arbitrary (possibly non-convex) 
distance-based Lipschitz losses and explicitly shows the dependence on the representation dimen¬ 
sion D. Cao et al. ( |2013 i provide an alternate analysis based on norm regularization of the weighting 
metric for distance-based metric learning. Their result parallels our norm-regularized criterion in 
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Lemma|^ While they focus on analyzing a specific optimization criterion - thresholds on the hinge 
loss with norm-regularization, our result holds for general Lipschitz losses. 

It is worth emphasizing that none of these related works discuss the importance of or leverage 
the intrinsic structure in data for the metric learning problem. Our results in Section [^formalize 
an intuitive notion of dataset’s intrinsic complexity for metric learning and show sample complexity 
rates that are hnely tuned to this metric learning complexity. 

The classiher-based framework we discuss has parallels with the kernel learning literature. The 
typical focus in kernel learning is to analyze the generalization ability of the hypothesis class of 
linear separators in general Hilbert spaces ( |Ying & Campbelf} 2009 [ Cortes et al. 2010| l. Our work 
provides a complementary analysis for learning explicit linear transformations of the given repre¬ 
sentation space for arbitrary hypotheses classes. 

Our theoretical analysis partly justifies the empirical success of norm-based regularization as 
well. Our empirical results show that such regularization not only helps in designing new metric 
learning algorithms ( |Lim et aLj|2013[[Law et al.||2014| l, but can even beneht existing metric learning 
algorithms in high-noise regimes. 
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A Appendix: Various Proofs 

A. 1 Proof of Lemma [1] 

Let V be the probability measure induced by the random variable (X, F), where X := {x^x'), 
Y :=l[y = y'], st. ((x, y), (F, y')) ~ (P x V). 

Define function class 


F-={fM.y^^\\M{x-x') 


M £ M 

X = {x,x') £{XxX) 


and consider any loss function (l)^{p,Y) that is A-Lipschitz in the first argument. Then, we are 
interested in bounding the quantity 


- m 

sup E(x.y)..p[</-^/M(X),y)] - _ V<^\/M(x,),r,), 

^ T- \ ' rn ^ 




2=1 


where Xi := {xi^i,X 2 ,i),Yi := l[yi^i = y 2 ,i] from the paired sample S'™ = {{{xi,i,yi,i),{x 2 ,i,m,i))}'^i- 
Define Xi := xi^i — X 2 ,i for each X^ = {xi^i,X 2 ,i)- Then, the Rademacher complexity^ of 
our function class F (with respect to the distribution V) is bounded, since (let cti, ..., am denote 
independent uniform {±l}-valued random variables) 


TZmiX, V) := Exi.CTi 


1 


sup — y^crj/M(Xi] 


= —Ex,,o-i sup aixjM^Mx^ 

iG[m] M^M 


= —Ex,,o-i sup 

iG[m] MGAT.s.t. 




CTiX-^ 


_ j^k i—1 


< — Ex,.CTi sup 
1X1 iG[m] M£M 


f m 


1 / 2 - 




j,k i—1 


< 


Vd 

m 

m 

Vd 

m 

Vd 


^Xi,2G[m] [ ^cri,iG[m] ^ ^ ^ ^ ^ (TiX^X^ ^ 


1/2 


j,k 2=1 


E- 


Xi ,iG [m] I 


m r, 

EE(*0 


1/2 


j,k 2=1 


1/2 


E- 


Xi ,2G [m] I 


X] Iki 


Il4 


E, 


(a:i,a:')~(7:) |x xX> |x), 
2G [ml 




1/2 


“^See the definition of Rademacher complexity in the statement of Lemmaj^ 
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1/2 


< 



{^{x,x')r-.('D\xX'D\x)\\^ 


< 



where the second inequality is by noting that < '/D for the class of weighting 

metrics := {M | M G ,ajna.x{M) = l}. 

Recall that T> has bounded support (with bound B). Thus, by noting that cj)^ is 8 B^ bounded 
function that is A-Lipschitz in the first argument, we can apply Lemmaj^and get the desired uniform 
deviation bound. I 


Lemma 8 [Rademacher complexity of bounded Lipschitz loss functions|Bartlett & Mendelson] 
(2002G Let B be a fixed unknown distribution over X x { — 1,1}, and let Sm be an Lid. sample of 
size mfrom T). Given a hypothesis class 'H C and a loss function f : IRx{—1,1} —>■ M, such 
that I is c-bounded, and is \-Lipschitz in the first argument, that is, sup^j^/ y)gR x {-i 1 } K(y^ n) I — 
c, and \£{y', y) — i{y", y)\ < X\y' — y"\, we have the following: 

for any 0 < (5 < 1, with probability at least 1 — 5, every h G B satisfies 


eiT(£ oh,!)) < err(£ o h, Sm) + 2XR.m{B,!)) + c 


21n(l/,5) 

m 


where 

• e.n{£oh,V) ■—R,^^y„,-D[t{h{x),y)], 

• &n(h,Sm) ■■= ^Y.{x„v,)^Sr„^iKxi),yi), 

• !Lm{B,!)) is the Rademacher complexity of the function class B with respect to the distribu¬ 
tion !) given m Lid. samples, and is defined as: 

!im{B,T>) :='E xi~D|x. 

[m] 

where Ui are independent uniform {±l}-valued random variables. 


sup — aih{xi) , 
hen m 


A.2 Proof of Lemma |2] 

We shall exhibit a finite class of bounded support distributions £>, such that if T) is chosen uniformly 
at random from 2 ), the expectation (over the random choice of B) of the probability of failure (that is, 
generalization error of the metric returned by A compared to that of the optimal metric exceeds the 
specified tolerance level e) is at least 5. This implies that for some distribution in 2) the probability 
of failure is at least 5 as well. 

Let Ad := {xq, ..., xu} be a set of H + 1 points that from the vertices of a regular unit- 
simplex from the underlying space X = as per Definition(see below). For a fixed parameter 
0 < a < 1 (exact value determined later), define 2) as the class of all distributions B on X x {0,1} 
such that: 

• B assigns zero probability to all sets not intersecting x (0,1}. 
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• for each i = 0,..., D, either 


- P[(a;j, 1 )] = (1 + v ^)/2 and 0 )] = (1 - yfa)l2, or 

- 1)] = (1 - ya )/2 and P[(a;i, 0 )] = (1 + y/a)/2. 

For concreteness, we shall use a specific instantiation of jj in errj;j,j with U = 0, L = A/D 
and X — DjA. 

Proof overview. We first show, by the construction of the distributions under consideration in 2), the 
sample error and the generalization error minimizing metrics over any V € 'D belong to a restricted 
class of weighting matrices (Eq. [^. We then make a second simplification by noting that finding 
these (sample- and generalization-) error minimizing metrics (in the restricted class) is equivalent to 
solving a binary classification problem (Eq.|^. This reduction to binary classification enables us to 
use VC-style lower bounding techniques to give a lower bound on the sample complexity. We now 
fill in the details. 

Consider a subset of weighting metrics Afo-i that map points in to exactly one of two 
possible points that are (squared) distance at least A/D apart, that is, 

A^o-i := {iff I J\d c .XA^3zq^zi G. M ,Vx C A^), 

Mx G {zo, zi} and \\zo — ziW"^ > A/D}. 

Now pick any I? G 2), let Sm be an i.i.d. paired sample from D. Observe that both the sample- 
based and the distribution-based error minimizing weighting metric from on I? also belongs to 
Mq-i- That is, (c.f. Lemma [TOll 

argmin^^g^ errdist(M, V) = argmin^g^^ _ errdist(M, V) 
argminJ^^g^ eiTdi.st(M, Sm) = argmin^^g^^ _ errdist(M, Sm)- (5) 

A reduction to binary classification on product space. Eor each M G Mo-i, we associate a 
classifier /m : (A^j x Ab) ^ {0,1} defined as {xi,Xj) i—)■ l\Mxi = Mxj\. Now, consider the 
probability measure V induced by the random variable (X, Y), where X := (x, x'), Y := l\y = y'], 
s.t. ((x,j/),(x',y')) ~ (25 |(Aox{o,i}) X 2? l(Acx{o,i}))- It is easy to check that for all M G Mq-i 

errd\.(M,I?) = E(x,y)..p [ 1 [/m(X) ^ E]] 
en-LCAf, S'„^) = — ^1 x')) ^ l[y = y'\]. (6) 

{ix,y),(x',y'))eSrr. 

Define 

p(X) := Py,.p|^,,[r = l|X] 

= P{y,y')HV X [y = y'\x, x'] 

r i + f ifP(t/ix) = p( 2 /'ixo 
\ i-f ifP(t/|x)^P(y'|x') ■ 

Observe that rj/X.) is the Bayes error rate at X for distribution V. Since, by construction of Alo-i. 
the class {/m}mgA1o-i contains a classifier that achieves the Bayes error rate, the optimal classifier 
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f* := argmin^^E(x,y)^-p 1 [/m(X) ^ Y] necessarily has /*(X) = 1[77(X) > i] (for all X). 
Then, for any Jm, 

E(x,y)^P [ 1 [/m(X) ^ y]] - E(x,y)^p [ ![/*(X) ^ F]] 

= Ex^Pi, [r;(X)( ![/* (X) = 1] - 1 [/m(X) = 1]) 

+ (1 - r?(X))( l[r (X) = 0] - 1 [/m(X) = 0])] 

= Ex^Pi, [(277(X) - 1)( l[r (X) = 1] - 1 [/m(X) = 1])] 

= Ex,.p|, [2h(X) - 1/2| • 1 [/m(X) + r (X)]] 

= ^ f*{{xi,Xj))]], ( 8 ) 

^ i>j 

where (i) the second to last equality is by noting that /*(X) ^ 1 ^ 1/2, and (ii) 

the last equality is by noting Eq. 0 , fM{{xr,Xi)) = f*{{x^,x^)) = 1 for alii and f{{x^,Xj)) = 
f{{xj,Xi)) for all /. For notational simplicity, we shall define X^j := {xi, Xj). 

Now, for a given paired sample Sm, let N{Sm) ■= {Ni)i (for all 0 < * < D), where Ni is the 
number of occurrences of the point Xi in Sm- Then for any Jm, 


ESr, 


{D + iy 


■ ^i[/M(x,,,)^r(x,,, 


i>j 


{D + iy 


■ 5]P5„[/M(x,,,)^r(x,,, 


i>j 


E PsJfM{X.,j)7^r{X,.,)\N{Sm) = N]-P[N{Sm) = N] 

' i>3 AfGN^+i 

—^ ^ P[N{Sm) = N] . ^P 5 „[/m(X„) ^ r(X,,,)|7V„iV^] 


> 


(Z? + l )2 


E p["(s™) = "i-Ei 1- 


1 D 
> - 


4D + 1 


WGNO+i 


i>j 




1 — exp 


— (maxjiVi, Nj} + l)a^ 

i — Qf^ 


\ 


1 — exp 


-((2m/(i:» + l)) + l)a2 
1-^2 


> ^ 1 - 


\ 


1 — exp 


-((2m/(i:> + l)) + l)a2 
1 — q ;2 


where (i) the hrst inequality is by applying Lemma 11 (ii) the second inequality is by assuming 
WLOG Ni > Nj, and noting that the expression above is convex in Ni so one can apply Jensen’s 
inequality and by observing that E[(Vi] = 2m/{D + 1) and that there are total D{D + 1) summands 
for i > j, and (iii) the last inequality is by noting that D > 1. Now, let B denote the r.h.s. quantity 
above. Then by recalling that for any [0, l]-valued random variable Z,P{Z >7) > EF — 7 (for all 
0 < 7 < 1 ), we have 


1 


^i[fM{{xi,Xjy) y- f*{{xi,xj))] >jB> (1 - 7 ) 5 . 


i>j 
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Or equivalently, by combining Eqs. (|^, (|^ and ([^, we have 


lE-D^unifO)? Sm^Ti 


errdist(^(5'™),X>) - eiTdist(M|,, X>) > 2ajB 


> ( 1 - 7 ) 5 , 


where := argmin^^^g^ errdi.st(M, 27) and A{Sm) is any metric returned by empirical error 
minimizing algorithm. Now, if (cond. I)i3 > 5/1 — 7 and (cond. 2) e < 2'yaB, it follows that for 
some 27 S £> 




errdist(,A(S'm),27) - errdist(M|,, 27) > e 


> 5. 


(9) 


Now, to satisfy cond. 1 & 2, we shall select 7 = 1 — 165. Then cond. 1 follows if 


m < 


(27 + 1) / 1- 


ln(4/3)-l . 


Choosing parameter a = 8 e /7 (and by noting B > 1/16 by cond. 1 for choice of 7 and m), cond. 
2 is satisfied as well. Hence, 


m < 


(27 + 1) / (1 - 165)2 - (8e)2 


64e2 


ln(4/3) - 1 


implies Eq. @. Moreover, if 0 < e, 5 < 1/64 then m < would suffice. I 


Definition 1 Define n + 1 vectors A„ = {uq, ..., Vn}, with each Vi € K" as 


-1 



for 1 < j < n 


I’iJ = 



ifi = j 

otherwise 


for 1 < 2 , J < n 


Fact 9 [properties of vertices of a regular n-simplex] Let A„ = {uq, ..., Vn} be a set ofn + 1 

vectors in K" as per Definition^ Then, A„ defines vertices of a regular n-simplex circumscribed 
in a unit {n — \)-sphere, with 

(i) WviW^ = 1 (for all i), and 

(ii) \\vi - VjW^ = 2{n + l)/n (fori ^ j). 

Moreover, for any non-empty bi-partition 0/A„ into A^^^ and A(f^ with |A(i^^| = k and |A(i^^| = 
n + 1 — fc, define and a^ 2 ) the means (centroids) of the points in A(i^^ and An'^ respectively. 
Then, we also have 

(i) — 0 ^ 2 )) . (q,(*) _ y.'j = 0 (for i e {1, 2}, and Vj G An^). 

(ii) ||a(i) - a( 2)||2 = > ^,for 1 < k < n. 
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Lemma 10 Let Ajy be a set of D + 1 points {A^oj ■■■, Xu} in as per Definition and let 
D be an arbitrary distribution over Ajj x {0,1}. Define Pi l[P'D[(Xi, 1)] > 1/2]. Define 
n := {tt : Ad —>■ K^} be the collection of all functions that maps points in Ad to arbitrary points 
inMP. Define 


fiix,y),ix',y')]TT) 


lk(a;)-7r(a;')lP} ify = y' 

minjl, [1-f ||7r(a;)-7r(x')|P] + } ifyf-y' ' 


Let E[tt) := '&{x,y),(x' ,y')r.v xv[f{{x,v),{x' ,y'fiTT)\ and £* := inf^£:( 7 r). Then, for any tt £ 11 
such that 


(i) TT{Xi) = TT{Xj), ifP, = Pj 

(ii) \\Ti{Xi) - 7f(Xj)||2 > ifP^ f Pj, 

we have that £(7f) = £*. Moreover, define A as 

• A ■.= ||^]-~^°|| , where Aq := mean(Ai) such that Pi = 0, and Ai := mean(Ai) such that 
Pi = 1 (if exists at least one Pi = 0 and at least one Pi = 1). 

• A := 0, i.e. the zero vector in (otherwise). 

And let M be a D X D matrix (with (Jraa.x{XI) = Ij defined as 

M:=AA\ 


Then the map ttm '■ x Mx constitutes a map that satisfies conditions (i) and (ii) and thus 
£{7 tm)=£*. 

Proof The proof follows from the geometric properties of Ad and Fact|^ I 

Lemma 11 Given two random variables ai and a 2 , each uniformly distributed on {a_, a+j inde¬ 
pendently, where a- = 1/2 — e/2 and a+ = 1/2 + e/2 with 0 < e < 1. Suppose that ..., 
and ..., are two i.i.d. sequences of {0, l}-valued random variables with P(^/ = 1) = ai 
and P(^f = 1) = ol 2 for all i. Then, for any likelihood maximizing function f from {0,1}"* to 
{a-, a+l that estimates the bias ai and a 2 from the samples. 


(/(Cl,..., d) ^ ai and /(C{,..., ^^) = 112 ), 
or(/(Ci\---,Cm) = OLi and/(C^...,C™) -h <^ 2 ) 



Proof Note that 


(/(Ci,---,Cm) 7^ Oil and/(^J,...,^^) = a 2 ),or(/(^{,...,^/,) = ai and..., ^ a 


= P[/(e{,...,Ci) 7^ ai] • P[/(C{,...,d) = a2] + P[/(C{,... ,d) = ai] • P[/(C?, ■ ■ ■ 
> ^P[/(Ci\ ..., d) 7 ^ ai] + ip[/(C?,..., d) 7 ^ ^2] 



where the first inequality is by noting that a likelihood maximizing / will select the correct bias 
better than random (which has probability 1/2), and the second inequality is by applying Lemma 

M I 


) A “ 2 ] 
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Lemma 12 [Lemma 5.1 of [Anthony & Bartlett] ( |1999H Suppose that a is a random variable 
uniformly distributed on {a-, «+}, where a_ = 1/2 — e/2 and a+ = 1/2 + e/2, with 0 < e < 1. 
Suppose that ^i,... ,^rn ore i.i.d. {0, l}-valued random variables with P(^i = 1) = a for all i. Let 
f be a function from {0,1}"* to {q!_, Qf+j. Then 





A.3 Proof of Lemma |3] 

For any M € M define real-valued hypothesis class on domain X as Hm ■= {x e-)- h{Mx) : h G 
Ti] and define 


X ■.= {x h{Mx) :MGM,hGn} = \JnM- 

M 


Observe that a uniform convergence of errors induced by the functions in F implies convergence 
of the class of weighted matrices as well. 

Now for any domain X, real-valued hypothesis class Q C [0,1]^, margin 7 > 0, and a sample 
S C X, define 


COY j{g,s) 


ccg 


ygGg,3g'GC, 

maxses l5(s) - 9'{s)\ < 7 


as the set of 7 -covers of S' by Let 7 -covering number of g for any integer m > 0 be defined as 




max min ICI, 
ScX:|S|=m Cecov^(g,S) 


with the minimizing cover C called as the minimizing ( 7 , TO)-cover of g 


Now, for the given 7 , we will first estimate the 7 -covering number of F, that is, Noo {l, -S, m). 
For any M G A4, let JLm be the minimizing ( 7 / 2 ,TO)-cover of "Hm- Note that \Hm\ = 
A/’oo( 7 / 2 ,Hm,w) < JVoo{'j/2,H,m) (because MX C X). 

Now let Ale be an e-spectral cover of Xi (that is, for every M G Xi, exists M' S Ale such that 
o’max(lW ~ M') < e), and define 


Pe := {a; h{Mx) : M G Ale, h G Hm}- 

2D/e)^^ (c.f. Lemma 


Note that |F/| < |Ale||lL/| < Xfoo{'j/‘ 2 -,H,m){l + 2Dfe)^~ (c.f. Lemma 13l. Observe that 
is a ( 7/2 -f ilAe)-cover of F, since (i) for any f G F (formed by combining, say, Mq G X4 and 
ho G H), exists / G namely the / formed by Mq such that fTmax(M3 ~ ^0) S e, and (ii) 
ho S HMg such that \ho{Mox) — ho{Mox)\ < 7/2 (for all x G X). So, (for any x G X) 


\f{x) - f{x)\ 


= 

\ho{Mox) - 

ho{Mox)\ 

< 

\ho{Mox) - 

ho{Mox)\ 


+ \ho{Mox) 

- ho{Mox)\ 

< 

X\\Mox - Mox\\ + 7/2 

< 

Acrmax(Afo - 

-m\x\\ + 

< 

XeB + y/2. 
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So, if we pick e = min{ 2^, ^}, it follows that 

Moo{l,^.rn) < |F,| <AAoo(7/2,H,m)(l + 2f?/e)^' 

By noting Lemmas [l4| and [TS] it follows that 




3f G T : err(/) > err^if, S^) + ' 


<4 1 + 


2i?^^7128m^ Fa+/i«(«) In ( pas%TK,J 


/ V 72 y 


The lemma follows by bounding this failure probability with at most S. I 


Lemma 13 [e-spectral coverings of D x D matrices] Let M := {M \ M e K , crmax(-M) = 
1} be the set of matrices with unit spectral norm. Define A4e as the e-cover of M, that is, for every 
M G A4, there exists M' G A4e such that (Tmax(-W ~ M') < e. Then for all e > 0, there exists A4e 

such that \M.f\ < (l 


2D 




D 


Proof Fix any e > 0 and let A4 /_d be a minimal size (e/Zl)-cover of Euclidean unit ball Bd in 
That is, for any v G B^), there exists v' G A4/d such that ||ri — z;'|| < e/D. Using standard volume 


arguments (see e.g. proof of Lemma 5.2 of 
Define 


Vershynin 


(2010l), we know that |A4 /d| 5; (l + 


D 


M, := 


:= [m' I M’ = K ■■■v'j,]G R^^^,v'gK/d}- 


Then A4e constitutes as an e-cover of A4, since for any M = [t;i • • • vd] G A4 there exists M' = 
[v[ - ■ ■ v'jf\ G A4e, in particular M' such that ||ui — u'|| < e/D (for all i). Then 

a„,ax(M - M') < \\M - M'W^ = ^ \\v, - v'\\ < e. 

i 

Without loss of generality we can assume that each M' G A4e, cTmaxiM') = 1. Moreover, by 
construction, \M.t\ < (l + ■ I 

Lemma 14 [extension of Theorem 12.8 of [Anthony & Bartlett| ( |I999) ] Let LL be a set of real 
functions from a domain X to the interval [0,1]. Let 7 > 0. Then for all m>l, 

A/‘oo(7,'H,to) < 

for some universal constant cq. 


Proof Theorem 12.8 of Anthony & Bartlett (1999i asserts this for m > Fat..y/4('H) > 1 with 
Co = 2. Now, if 1 < TO < hat.y/4(7f), for some universal constant c', we have < 

{dhT" < (c77)''"U/4(«). I 

Lemma 15 [Theorem 10.1 of jAnthony & Bartlett| ( |1999) ] Suppose that LL is a set of real-valued 
functions defined on domain X. Let D be any probability distribution on Z = Xx {0,1}, 0 < e < 1. 
real 7 > 0 and integer to > 1. Then, 


Ps„~x> 3/1 e "H : err(/i) >err.^(/i, 5^)+ e < 2A/'oo Q, "H, 2 to^ e 
where Sm is an i.i.d. sample of size mfrom D. 


-Pmj'g, 
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A.4 Proof of Lemma |4] 

For any fixed 0 < 7 < 1/8 and the given bounded class of distributions with bound B > 1 , 
consider a (l/_B)-bi-Lipschitz base hypothesis class B that maps hypothesis from the domain X to 
[ 1/2 — 47, 1/2 + 47], and dehne 

X:={x^ h{Mx) : M £ 


Note that finding M that minimizes errhypoth is equivalent to finding / that minimizes error on B. 
Using Lemma 19 we have for any 0 < 7 < 1 / 2 , the sample complexity of B is (for all 0 < e, 5 < 
1 / 64 ) 


Fat2T-(7r4,y(J')) 

- 320e2 


( 10 ) 


where 7r4^(J^) is the {A'y)-squashed function class of B (see Dehnition|^below). We lower bound 
Fat27(7r4,y(J^)) in terms of fat-shattering dimension of B to yield the lemma. 

To this end we shall first dehne the (7, TO)-covering and packing number of a generic real-valued 
hypothesis class Q. For any domain X, real-valued hypothesis class Q C [ 0 , 1 ]^, margin 7 > 0 , and 
a sample S C X, dehne 


cov^iG.S) := 

pakjg,s) :=\pcg 


ygeg,3g'GC, 

max^gs \g{s) -5'(s)| < 7 

Vff ^ 5' e P, 

max^gs \g{s) -5'(s)| > 7 


as the set of 7-covers (resp. 7-packings) of S by g. Let 7-covering number (resp. 7-packing number) 
of g for any integer m > 0 be dehned as 


■N'aoin.g^'m) := 
Poo{'l,g,m) := 


max min ICL 

ScXtlSI—m CGco\-y{0,S) 

max max I PI 

SCX:|S|=m PGpa^(e.S) 


with the minimizing cover C (resp. maximizing packing P) called as the minimizing (7, m)-cover 
(resp. maximizing (7, m)-packing) of g. 

With these dehnitions, we have the following (for some universal constant cq). 


( m N Fat2^(7r4^(P))ln(em/27) 

'^°VlfV^ > A/oo(87,7r4.y(P),m) 

> Poo( 167 , 7 r 4 -y(P),TO) 

> (t^) 7^00(487, 7r4-y(P),m) 

= (^) Poo( 487 ,P,m) 

> (t^) A/'oo(487,'H,to) 


[Lemma [T 4 l 
[Lemma [TT) 
[see (*) below] 

[by the choice of B} 

[Lemma [TT) 
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[Lemma [TSl ( 11 ) 


> ^ ^ pFat76s^(W)/8 

- V 327 / 

(*) We show that 7 ^ 00 ( 167 , 7 r 4 ^(J^), m) > (l/ 327 )'°^ 7 ^oo( 487 , 7 r 4 -y('H), to), by exhibiting a set 
S C of size ( 1 / 327 )^^ 7 ^ 00 ( 487 , 7 r 4 ^('H), to ) that is a ( 167 )-packing of 7 r 4 ^(J^). 

Let 7 r 4 ^( 7 f 487 ) C be a maximal ( 327 )-packing of 'K 4 ^{'H) (that is, a maximal set such 

that for all distinct ( 7 r 47 o 7i), ( 7 r 47 o h') G 7 r 47 ( 7 f 487 ), exists x G X such that \'K 4 ^{h{x)) — 
'K 4 ^{h'{x))\ > 487 ). Fix e (exact value determined later), and define 


iSe := < a; !->■ ( 7 r 47 o h){Mx) 


(^47 ^ )^) ^ ^47(7^487); 

MgM, 


where A4e is a e-spectral net of A4, that is, for all M G Ai, exists M' G AAe such that (Tmax()17 — 
M') < e, and for all distinct M', M" G AA^, cr^s,x(M' — M") > e/2. 

Then for any two distinct /,/' G S^, such that f{x) = ( 714 ^ o h){Mx) and f'{x) = ( 774 ^ o 
h'){M'x), we have 

• (case 1) h and h' are distinct. In this case, there exists x G X, s.t. 

\f{x) - f{x)\ =\Tr 4 ^{h{Mx)) - TT 4 j{h'{M'x))\ 

> \Tr 4 j{h{Mx)) — n 4 -y{h'{Mx))\ 

— \'K 4 j{h' {M x)) — ■K 4 ^{h' {M' x))\ 

> 487 -(l/B)(T„,ax(M-M')||x|| 

> 487 — {1/B)eB = 487 — e. 


• (case 2) h, h' same but M and M' distinct. In this case, there exists x (with ||x|| = l)s.t. 


\fix)-f'ix)\ = 


> 

> 

> 


\Tr4j{h{Mx)) — Tr4-y{h{M'x))\ 
\h{Mx) - h{M'x)\ 
B\\{M-M')x\\ 

B ■ min ~ M') 

Biel2). 


Thus, by setting e = 327 , distinct classifiers /, f G 5327 are at least I 67 apart (since B > 1). 
Hence ^327 forms a ( 167 )-packing of Tr 4 j{X). Therefore, the packing number 

^00(167,7147(7'), to) > I5327I = I7W327117^4871 ^ (1/327)^Voo( 487, 7r47(77),TO). 


Thus, from Eq. ( [TT] ), it follows that 

Fat27(7r47(7)) > 


77^ ln(l/ 7 ) + Fat 7687 ( 77 ) 
ln(TO/7) 


Combining this with Eq. ( [TOl l, the lemma follows. I 

Lemma 16 [e-spectral packings of 77 x 77 matrices] Let Ai := {M \ M G cri„ax(117) = 

1} 7'e the set of matrices with unit spectral norm. Define A4e <G Ai as the e-packing of Ai, that is, 
for every distinct M, M' G AAe, (J-nxax{M — M') > e. Then for all e > 0, there exists Aig such that 

l-^.l > ii.f- 
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Proof. Fix any e > 0 and let Ve be a maximal size e-packing of Euclidean unit ball in 
That is, for all di stinct v, u' € Bj), ||u — u'|| > e. Using standard volume arguments (see e.g. proof 
’ Vershynin |2010 1 ), we know that \V(\ > 


of Lemma 5.2 of 


V D 


^2e) 


Define 


M, := 


{m' \m' = [v[ ■■■ v'o] e e r^y 


Then constitutes as an e-packing of A4, since for any distinct M,M' € Ale such that M = 
[t;i • • • vd] and M' = \v'i - ■ ■ u^], we have 


- M') > max ||ui - v[\\ > e. 

i 


Without loss of generality we can assume that each M G A4e, cTmax(M) = 1. Moreover, by 
construction, |Ade| > (^) .1 

Lemma 17 [follows from Theorem 12.1 of [Anthony & Bartlett| ( [l999| l] For any real valued 
hypothesis class FL into [0,1], all m > 1 , and 0 < 7 < 1/2, 


7^00(27,^,771) < < Voo { l ,' H , m ). 


Lemma 18 [Theorem 12.10 of Anthony & Bartlett ( 1999) ] Let FL be a set of real functions from 
a domain X to the interval [0,1]. Let 7 > 0. Then for m > Fati 6 .),('H), 






Lemma 19 [Theorem 13.5 of [Anthony & Bartlett| ( |1999) l] Suppose that FL is a set of real-valued 
functions mapping into the interval [ 0 , 1 ] that is closed under addition of constants, that is. 


h G FL h' G FL, where h' : x ^ h{x) -f c for all c. 


Pick any 0 < 7 < 1/2. Then for any metric learning algorithm A for all 0 < e, 6 < 1/64, there 
exists a distribution D such that ifm < 32 ^, 

Ps„-X)[eiT(/i*,D) > en:j{A{Sm),'D) -f e] > (5 

where d := Fat2-),(7r4.y('H)) > 1 is the fat-shattering dimension of '!T 4 -y{FL) — the ( 47 )-squashed 
function class ofFL, see Definition^below —at margin 2y. 

Definition 2 [squashing function] For any 0 < 7 < 1/2, define the squashing function tt.^ : K — 
[1/2-7, 1/2 + 7 ] as 

( 1/2 4-7 if a> 1/2-\-y 
7r.y(a) = < 1/2-7 i/a < 1/2-7- 
I a otherwise 

Moreover, for a collection F of functions into K, define T^-yiF) := {tt^, o f \ f G F}. 


A.5 Proof of Lemma |5] 

Let V be the probability measure induced by the random variable (X, F), where X := {x,x'), 
Y :=l[y = y'], st. {{x, y), {x', y')) - (D x V). 
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Define function class 




M e M 

X = {x,x') GiXxX) 


Following the steps of proof of Lemma [T] we can conclude that the Rademacher complexity of 
X is bounded. In particular, 


nrr^{X) < 4S' 


SUPMG^t 


The result follows by noting that (p is A-Lipschitz in the first argument and by applying Lemma 


A.6 Proof of Lemma |6] 

Consider the function class 

x-.= {h. M '■ X ^ V ■ Mx I ||z;||i < 1,M G Al|, 
and dehne the composition class 


K 


J'cr := < a:: ^ Wia'^{fi{x)) 


Ikdii <1. 1 

fi,-..jKex f- 


Then, first note that the Gaussian complexity of X (with respect to the distribution V) is bounded, 
since (let g\, ■ ■ ■ ,gm denote independent standard Gaussian random variables) 


Gm{X,V) := 

gi,i&[m] 

1 


- —^X,r^'D\x 


m 


sup 

m 

sup V ■'^g^{Mxi) 


MeM 

ILIIi<i 


2=1 




m 


max sup > gi{Mxi)j 

■i n /T r- A ^ ^ 


a mem ^ 


< —I,, max 

m JG[D] 


sup \{Mx^)j\ 
mem 


< 


cln5(Li) 


m 


Exi^-Dx max 


E„ 


m 

y^^9i{ sup \{Mx^)j\- sup \{M'xi)f\) 
^MgM M'GM ^ 

2=1 


2v 4 


cln2 (£)) 


( m 

sup |(Ma;i)j| — sup \{M'xi)j 

^ ^MEM M'i=M 


M’EM 


< cB 


d\nD 

m 


where (i) second to last inequality is by applying Lemma 20 (ii) c, d are absolute constants, 
(iii) d := supjy^g^ Note that bounding the Gaussian complexity also bounds the 

Rademacher complexity by Lemma 2T 
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Finally by noting that is a 7 -Lipschitz composition class of T and is a classification based 
loss function that is A-Lipschitz in the first argument, we can apply Lemma [^yielding the desired 
result. I 

Lemma 20 [Lemma 20 of [Bartlett & Mendelson] ( [2002) ] Let Zi,..., Zy) be random variables 
such that each Zj = where each gi is independent iV(0,1) random variables. Then 

there is an absolute constant c such that 

maxZj < cln^ (D) max a/ fig. (Zj — Zj/Y. 
i 3 , 3 ' ^ 

Lemma 21 [Lemma 4 of [Bartlett & Mendelson] (2002) ] There are absolute constants c and C 
such that for every class T and every integer m 

where TZ and Q are Rademacher and Gaussian complexities of a function class F with respect to the 
distribution T) respectively. 

A.7 Proof of Corollary 1^ 

The conclusion of Eq. (|^ is immediate by dividing the given failure probability 5 across the sequence 
■ ■ ■ such that 6p,d failure probability is associated with class A4‘^, then apply Lemma]^ 
(for distance based metric learning) or Lemma (for classifier based metric learning) to each class 
individually, and finally combining the individual deviations together with a union bound. 

For the second part, for any M £ Ai define dM and Am as per the lemma statement. Then with 
probability at least 1 — 5 

err^(M-s,I?) - ,V) < err^(M-s,5„) + - err^(M*,I?) 

< en\M*,S^) + d^.A^, 

< 0(d„.A„.) = 0(e), 

where (i) the first inequality is by applying Eq. on weighting metric (with failure proba¬ 
bility set to 5/2), (ii) the second inequality is by noting that M/Zf is the (regularized) sample error 
minimizer as per the lemma statement, (iii) the third inequality is by applying Eq. ([^ on weighting 
metric M* (with failure probability set to 5 /2), and (iv) the last equality by noting the definitions of 
AM* and our choice of m. I 
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